This document contains all code that I’ve used to classify the peptide sequences given in the denovo_peptide_bacteria.tsv file. Unless you are interested in the methods used, feel free to skip ahead to the results.

Processing

Initial setup

These tabs are really just so I can keep a record of what I did (and what didn’t work when I was building the kraken2 database).

Setup

These lines of code take the protein sequences from the supplied .tsv file and convert them to a .fasta file that contains each peptide, names the peptides with a number and adds the description, taken from the previous sample name. It also makes individual .fasta files with each sequence (but I found that having almost 1,000,000 individual sequence files was a bit overwhelming).

peptides = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/wetransfer-6c2cc0/denovo_peptide_bacteria.tsv', header=0, index_col=0, sep='\t')
names = list(peptides.index.values)
pep_list = list(peptides.loc[:, 'peptide'].values)
sequences = []
for n in range(len(names)):
  record = SeqRecord(Seq(pep_list[n], IUPAC.protein),id='peptide'+str(n+1), description=names[n])
  sequences.append(record)
  #SeqIO.write(record, "/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/all_peptides/peptide"+str(n+1)+".fasta", "fasta")
SeqIO.write(sequences, "/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/peptide_sequences.fasta", "fasta")

Build Kraken2 protein database (unsuccessful attempts)

Note that these were run on a server and not locally, I just have these lines here to save the code used. I’ve had problems downloading the standard kraken2 database, and I think this seems to be linked to having NA’s in the NCBI taxonomy, but I get the error:

kraken2-build --download-library bacteria --db bac_protein --use-ftp #didn't work
Step 1/2: Performing ftp file transfer of requested files
rsync_from_ncbi.pl: FTP connection error: Network is unreachable

even after the edits to the rsync_from_ncbi.pl script, as suggested in this issue AND as suggested in this issue.

Running the standard library build on kronos didn’t work either:

wget http://github.com/DerrickWood/kraken2/archive/v2.0.8-beta.tar.gz
tar -xf v2.0.8-beta.tar.gz
cd kraken2-2.0.8-beta/
./install_kraken2.sh kraken2_dir
cp kraken2/kraken2-2.0.8-beta/kraken2_dir/kraken2 anaconda3/bin/
cp kraken2/kraken2-2.0.8-beta/kraken2_dir/kraken2-build anaconda3/bin/
cp kraken2/kraken2-2.0.8-beta/kraken2_dir/kraken2-inspect anaconda3/bin/
kraken2-build --standard --db path_to_folder --protein --threads 24 --use-ftp

Output (before and after editing the rsync_from_ncbi.pl script, as above):

Downloading taxonomy tree data... done.
Untarring taxonomy tree data... done.
Step 1/2: Performing ftp file transfer of requested files
Step 2/2: Assigning taxonomic IDs to sequences
Processed 377 projects (934707 sequences, 269.17 Maa)... done.
All files processed, cleaning up extra sequence files... done, library complete.
Masking low-complexity regions of downloaded library... done.
rsync_from_ncbi.pl: unexpected FTP path (new server?) for na

So I have gone with the non-redundant protein database for now:

kraken2-build --download-library nr --db bac_protein --use-ftp --protein
kraken2-build --db bac_protein/ --download-taxonomy --use-ftp
kraken2-build --build --db bac_protein/ --threads 10

Output:

Creating sequence ID to taxonomy ID map (step 1)...
Found 7913/573044508 targets, searched through 753265413 accession IDs, search complete.
lookup_accession_numbers: 573036595/573044508 accession numbers remain unmapped, see unmapped.txt in DB directory
Sequence ID to taxonomy ID map complete. [1h31m43.606s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 1460 bytes
Capacity estimation complete. [3m11.010s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 11 bits reserved for taxid.
Processed 1311 sequences (377139 bp)...  

I let this run for ~5 days and this is as far as it got. But with the low numbers of classified sequences anyway, maybe this isn’t the way to go. Will try putting together a larger database.

Download files (using the scripts at this repository, edited by me to download ’_protein.faa.gz’ rather than ’_genomic.fna.gz’ files):

kraken2-build --download-taxonomy --db kraken2_protein/ --threads 12 --use-ftp
bacteria
archaea
viral
human
fungi
protozoa

#For each of the domains

perl download_bacteria_faa.pl 

for file in bacteria/*.faa
do
    kraken2-build --add-to-library $file --db kraken2_protein
done

#And then

kraken2-build --build --db kraken2_protein --threads 12 --protein

These each crashed at some point before downloading all files. As I don’t know enough perl to accurately change the files.

Build Kraken2 protein database (successful code)

I wrote a script with the same steps as the perl scripts, but in Python. At the moment I have added only complete genomes, but these could be added to with all genomes at some point if useful. (This is a huge amount more for bacteria, though, but not sure how much this will change species assignments). This script is saved at this repository.

And then:

#For each domain
for file in bacteria/*.faa
do
    kraken2-build --add-to-library $file --db kraken2_protein --threads 12
done
mv viral added_kraken2_protein

kraken2-build --build --db kraken2_protein --threads 12 --protein

So the database includes:
- 362 archaea (total available including not complete = 1082)
- 17,836 bacteria (total available including not complete = 190,989)
- 11 fungi (total available including not complete = 328)
- Human genome
- 4 protozoa (total available including not complete = 94)
- 9,756 viruses (total available including not complete = 10,153)

Classify the peptides

Kraken2

This is the code used to run Kraken2 using the fasta file created above and the protein database constructed above (with and without the confidence parameter).

kraken2 --use-names --threads 12 --db databases/kraken/kraken2_protein/ --memory-mapping peptides/peptide_sequences.fasta --output peptides/kraken2_out/peptides.kraken --report peptides/kraken2_out/peptides.kreport --confidence 0.1

No hits were found:

Loading database information... done.
1286515 sequences (11.99 Mbp) processed in 0.359s (214973.9 Kseq/m, 2002.90 Mbp/m).
  0 sequences classified (0.00%)
  1286515 sequences unclassified (100.00%)

BLASTp-short on all peptides

I have then combined the protein sequences downloaded from NCBI above into one .fasta file and made a BLAST database with them:

cat fungi/*.faa > combined_fungi.faa
cat protozoa/*.faa > combined_protozoa.faa
cat archaea/*.faa > combined_archaea.faa
cat viral/*.faa > combined_viral.faa
cat bacteria/*.faa > combined_bacteria.faa

mkdir combined_faa
mv combined_fungi.faa combined_faa
mv combined_protozoa.faa combined_faa
mv combined_archaea.faa combined_faa
mv combined_bacteria.faa combined_faa
mv combined_viral.faa combined_faa
cp vertebrate_mammalian/GCF_000001405.39_GRCh38.p13_protein.tax.faa combined_faa/

cat combined_faa/*.faa > combined_all.faa

makeblastdb -in combined_all.faa -dbtype prot

I am then running blastp (blastp-short - optimized for amino acid residues <30) against this database (copied to ramdisk - this took ~10 days to run):

sudo mount -t ramfs none /scratch/ramdisk/
sudo cp -a protein_db_ncbi/ /scratch/ramdisk/

blastp -task blastp-short -query peptide_sequences.fasta -db /scratch/ramdisk/protein_db_ncbi/combined_all.faa -out all_peptides.out -num_threads 12 -outfmt '10 qseqid sseqid pident evalue bitscore  length mismatch positive gaps'

I seem to end up with some kind of a match for ~1 in 10 peptides. Note that these all have high E values (even when the identity is 100%), but I assume that this is just because the peptides are too short for this to be calculated properly.

BLASTp-short (basic processing)

Combine assembly summaries to get NCBI taxon IDs:

domains = ['archaea', 'bacteria', 'fungi', 'protozoa', 'vertebrate_mammalian', 'viral']
path = '/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/download_domain_test/'
domain_list = []
for d in range(len(domains)):
  this_dom = pd.read_csv(path+domains[d]+'/assembly_summary.txt', sep='\t', header=1, index_col=5)
  cols = list(this_dom.columns)
  cols.remove('organism_name')
  this_dom.drop(cols, axis=1, inplace=True)
  this_dom['domain'] = [domains[d] for x in range(this_dom.shape[0])]
  print(this_dom)
  domain_list.append(this_dom)
all_domains = pd.concat(domain_list)
all_domains.index = all_domains.index.map(str)

all_domains.to_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/all_domains.csv')

Make dictionaries of the peptide and sample names:

used_fasta = '/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/peptide_sequences_used.fasta'
used_dict = {}

for record in SeqIO.parse(used_fasta, "fasta"):
    used_dict[record.id] = [record.description, record.seq]

with open('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/rename_peptides_original.dict', 'wb') as f:
    pickle.dump(used_dict, f)
    
sample_dict = {}
table = '/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/wetransfer-6c2cc0/denovo_peptide_bacteria.tsv'

table = pd.read_csv(table, sep='\t', header=0, index_col=0)

rows = table.index.values
for r in rows:
    if r in sample_dict: continue
    sample_dict[r] = table.loc[r, 'Sample Name']

with open('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/sample_dict.dict', 'wb') as f:
    pickle.dump(sample_dict, f)

Remove peptides for which the matches are below 100% identity:

peptides = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/all_peptides_out.csv', header=0, index_col=0)
peptides = peptides.loc[peptides["pident "] == 100]
peptides.to_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/all_peptides_out_reduced.csv')

Add taxa assignments to peptides table:

all_domains = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/all_domains.csv', index_col=0, header=0)

peptides = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/all_peptides_out_reduced.csv', header=0)
peptides['organism_name'] = ['NA' for x in range(peptides.shape[0])]
peptides['genus'] = ['NA' for x in range(peptides.shape[0])]
peptides['domain'] = ['NA' for x in range(peptides.shape[0])]
peptides['peptide_name'] = ['NA' for x in range(peptides.shape[0])]
peptides['sample_name'] = ['NA' for x in range(peptides.shape[0])]
peptides['sequence'] = ['NA' for x in range(peptides.shape[0])]

with open('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/rename_peptides_original.dict', 'rb') as f:
  used_dict = pickle.load(f)
  
table = '/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/wetransfer-6c2cc0 2/denovo_peptide_bacteria.tsv'
sample_names = pd.read_csv(table, sep='\t', header=0, index_col=0)
sample_names_peptide = pd.read_csv(table, sep='\t', header=0, index_col=1)
sample_names_list = ['sample_name']
ma = 0

rows = list(peptides.index.values)
for a in range(len(rows)):
  seq = peptides.loc[rows[a], 'sseqid ']
  tid = int(seq.split('taxid|')[1])
  names = all_domains.loc[tid, :].values
  if not isinstance(names[0], str): 
      names = names[0]
  peptides.loc[rows[a], 'organism_name'] = names[0]
  peptides.loc[rows[a], 'domain'] = names[1]
  peptides.loc[rows[a], 'genus'] = names[0].split(' ')[0]
  pep = used_dict[peptides.loc[rows[a], 'qseqid ']]
  pep_name = pep[0].split(' ')[1]
  peptides.loc[rows[a], 'peptide_name'] = pep_name
  samples = sample_names_peptide.loc[pep_name, 'Sample Name'].values
  if len(samples) > ma:
    prev_ma = int(ma)
    ma = len(samples)
    for c in range(prev_ma, ma):
      sample_names_list.append('sample_name'+str(c+2))
      peptides['sample_name'+str(c+2)] = ['NA' for x in range(peptides.shape[0])]
  for b in range(len(samples)):
      peptides.loc[rows[a], sample_names_list[b]] = samples[b]
  peptides.loc[rows[a], 'sequence'] = str(pep[1])

    
peptides.to_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/all_peptides_out_taxa_3.csv')

Sort table to taxa assignments of peptides and an abundance table for samples vs peptides (here we check if all classifications of each peptide have the same organism name, or at least the same genus - otherwise these are removed):

taxa = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/all_peptides_out_taxa_4.csv', header=0, index_col=13)
taxa.drop(['qseqid ', 'Unnamed: 0.1'], axis=1, inplace=True)
taxa.drop_duplicates(keep='first', inplace=True)

unique_taxa_dict = {}
unique_taxa = list(set(taxa.index.values))
samples_all = []
all_samples_table = []
unique_taxa_added = []
domain_dict = {}

count = 0
for ut in unique_taxa:
    count += 1
    adding = False
    #if count > 200: continue
    if isinstance(taxa.loc[ut, 'organism_name'], str):
        genus_adding = taxa.loc[ut, 'organism_name']
        domain_dict[genus_adding] = taxa.loc[ut, 'domain']
        adding = True
    else:
        this_tax = taxa.loc[ut, 'organism_name'].values
        domain = taxa.loc[ut, 'domain'].values
        if len(list(set(this_tax))) == 1:
            genus_adding = list(set(this_tax))[0]
            domain_dict[genus_adding] = domain[0]
            adding = True
        else:
            this_genus = taxa.loc[ut, 'genus'].values
            if len(list(set(this_genus))) == 1:
                genus_adding = list(set(this_genus))[0]
                domain_dict[genus_adding] = domain[0]
                adding = True
    if adding:
        samples = list(taxa.loc[ut, :].values)
        this_peptide_samples = []
        if isinstance(samples[0], np.int64):
            for a in range(len(samples)):
                if isinstance(samples[a], str):
                    if 'sample' in samples[a]:
                        this_peptide_samples.append(samples[a])
        else:
            for b in range(len(samples)):
                if isinstance(samples[0], str):
                    if isinstance(samples[b], str):
                        if 'sample' in samples[b]:
                            this_peptide_samples.append(samples[b])
                else:
                    for c in range(len(samples[b])):
                        if isinstance(samples[b][c], str):
                            if 'sample' in samples[b][c]:
                                this_peptide_samples.append(samples[b][c])
        if this_peptide_samples == []:
            adding = False
    if adding:
        samples_all.append([ut, this_peptide_samples])
        unique_taxa_dict[ut] = genus_adding
        for a in range(len(this_peptide_samples)):
            if this_peptide_samples[a] not in all_samples_table:
                all_samples_table.append(this_peptide_samples[a])
        unique_taxa_added.append(ut)

print(len(unique_taxa_dict), len(samples_all))

feat_table = pd.DataFrame(columns=all_samples_table, index=unique_taxa_added).fillna(0)

for a in range(len(samples_all)):
    for b in range(len(samples_all[a][1])):
        feat_table.loc[samples_all[a][0], samples_all[a][1][b]] = feat_table.loc[samples_all[a][0], samples_all[a][1][b]] + 1

feat_table.to_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table.csv')
feat_table.insert(0, 'taxa', ['NA' for x in range(feat_table.shape[0])])
feat_table.insert(1, 'domain', ['NA' for x in range(feat_table.shape[0])])
for pep in feat_table.index.values:
    feat_table.loc[pep, 'taxa'] = unique_taxa_dict[pep]
    feat_table.loc[pep, 'domain'] = domain_dict[unique_taxa_dict[pep]]
feat_table.to_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_full_w_tax.csv')
ft_grouped = feat_table.rename(index=unique_taxa_dict)
ft_grouped = ft_grouped.drop(['domain', 'taxa'], axis=1)
ft_grouped = ft_grouped.groupby(by=ft_grouped.index, axis=0).sum()
ft_grouped.insert(0, 'domain', ['NA' for x in range(ft_grouped.shape[0])])
for tax in ft_grouped.index.values:
    ft_grouped.loc[tax, 'domain'] = domain_dict[tax]
ft_grouped.to_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv')

This reduces us from 42,776 peptides with classifications (of 1,286,515 original peptides) to 36,811, and gives an average of 214 peptides classified per sample.

HMMER with Pfam database

Get Pfam database:

wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip Pfam-A.hmm.gz

Run HMMER:

hmmsearch --tblout --domtblout Pfam-A.hmm peptide_sequences.fasta > Pfam_hmm.out

This gives hits (although probably not any that are over the inclusion threshold - they all have very high E values, likely because they are very short), but I haven’t looked at them yet. Waiting to see if this is something that we are interested in first.

Basic results

Prior to this, we have:
  • Classified all peptides using BLASTp-short with a protein database made with all complete genomes in the NCBI reference database. This gives classifications for 42,776 peptides (of the original 1,286,515 peptides)
  • Removed classifications that match with below 100% identity (note that all E values are very high, even when they have 100% identity - I think the E values are just not meant for peptides this short)
  • Removed classifications where multiple hits for the same peptide come back with different taxa (I have classified to genus where multiple hits are for the same genus)

This leaves us with 36,811 unique peptides (an average of 214 per sample) classified to six domains (i.e. bacteria, fungi, archaea, viral, protozoa and vertebrate mammalian (human)) with 6180 lowest classifications (genera or species) or 1770 genera.

Taxa classified

The overall proportions of taxa classified to domains, genera and species. Here we have removed those genera with <0.2% relative abundance and those species (also includes genera as this is just the lowest possible classification) with <0.15% abundance. Note that this choice of 0.2%/0.15% is arbitrary, and is based on giving good visualisation results - it may be a result in itself that there are no genera that have a large number of classifications belonging to them, e.g. removing all genera with <1% relative abundance leaves us with only 6 genera: Aspergillus (2%), Bacillus (1.2%), Streptomyces (2.4%), Pseudomonas (2.4%), Homo (4.6%) and Paenibacillus (1.3%).

Colors are cycled, but labels that appear at the top in the legend are plotted at 0, while those that appear last are at the top.

feat_table_full = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_full_w_tax.csv', header=0, index_col=0)
species = list(feat_table_full.loc[:, 'taxa'].values)
domain = list(feat_table_full.loc[:, 'domain'].values)
genera = list(feat_table_full.loc[:, 'taxa'].values)
for g in range(len(genera)):
  try:
    genera[g] = genera[g].split(' ')[0]
  except:
    genera[g] = genera[g]

domain_unique, species_unique, genera_unique = list(set(domain)), list(set(species)), list(set(genera))

fig = plt.figure(figsize=(15,20))
ax1 = plt.subplot(291)
ax2 = plt.subplot(293)
ax3 = plt.subplot(297)

domain_plot = []
for a in domain_unique:
  domain_plot.append(domain.count(a))
total = sum(domain_plot)
domain_plot = [(a/total)*100 for a in domain_plot]

genera_plot = []
for a in genera_unique:
  genera_plot.append(genera.count(a))
total = sum(genera_plot)
genera_plot = [(a/total)*100 for a in genera_plot]

species_plot = []
for a in species_unique:
  species_plot.append(species.count(a))
total = sum(species_plot)
species_plot = [(a/total)*100 for a in species_plot]

for a in range(len(genera_plot)):
  genera_plot[a] = [genera_unique[a], genera_plot[a]]

for a in range(len(species_plot)):
  species_plot[a] = [species_unique[a], species_plot[a]]

genera = pd.DataFrame(genera_plot, columns=['Genus', 'Count'])
genera.set_index('Genus', inplace=True)
genera = genera[genera.max(axis=1) > 0.2]
genera.sort_values(by='Count', ascending=False, inplace=True)

species = pd.DataFrame(species_plot, columns=['Species', 'Count'])
species.set_index('Species', inplace=True)
species = species[species.max(axis=1) > 0.15]
species.sort_values(by='Count', ascending=False, inplace=True)
print(species)
domain_colors = get_cols(len(domain_plot))
bottom=0
for a in range(len(domain_plot)):
  ax1.bar(1, domain_plot[a], bottom=bottom, color=domain_colors[a], label=domain_unique[a].split('_')[0], edgecolor='k')
  bottom += domain_plot[a]
ax1.legend(loc='upper left', bbox_to_anchor=(1, 1.02))
ax1.set_ylim([0, 100])
ax1.set_ylabel('Relative abundance (%)')
ax1.set_xticklabels([])
genera_colors = get_cols(genera.shape[0])
bottom=0
for a in range(len(list(genera.index.values))):
  name = list(genera.index.values)[a]
  ax2.bar(1, genera.loc[name, 'Count'], bottom=bottom, color=genera_colors[a], label=name, edgecolor='k')
  bottom += genera.loc[name, 'Count']
ax2.legend(loc='upper left', bbox_to_anchor=(1, 1.02), ncol=2)
ax2.set_ylim([0, 43])
ax2.set_xticklabels([])
species_colors = get_cols(species.shape[0])
bottom=0
for a in range(len(list(species.index.values))):
  name = list(species.index.values)[a]
  ax3.bar(1, species.loc[name, 'Count'], bottom=bottom, color=species_colors[a], label=name, edgecolor='k')
  bottom += species.loc[name, 'Count']
ax3.legend(loc='upper left', bbox_to_anchor=(1, 1.02))
ax3.set_ylim([0, 14.5])
ax3.set_xticklabels([])
ax1.set_title('Domain')
ax2.set_title('Genus')
ax3.set_title('Species')

plt.subplots_adjust(wspace=0.5)
plt.show()

NMDS (all taxa, lowest classification)

All taxa classified to species where possible (genus if not). This NMDS is calculated based on Bray-Curtis distance between samples. Note that some samples did not appear in the ‘sample_centered_table.tsv’ and therefore have been classified as ‘Unknown’ in the plots. These samples haven’t been normalised in any way.

ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)

ft.drop(['domain'], axis=1, inplace=True)
ft = ft.transpose()

pos, npos, stress = transform_for_NMDS(ft)

Colored by disease:

sample_names = list(ft.index.values)
sample_colors, sample_markers = [], []
for a in sample_names:
  if a in disease_colors_samples:
    sample_colors.append(disease_colors_samples[a])
  else:
    sample_colors.append('gray')
  if a in type_markers_samples:
    sample_markers.append(type_markers_samples[a])
  else:
    sample_markers.append('s')

plt.figure(figsize=(25,12))
ax1 = plt.subplot(121)
for a in range(len(npos)):
  ax1.scatter(npos[a,0], npos[a,1], color=sample_colors[a], marker=sample_markers[a], s=100)
ax1.legend(handles=handles_type+handles_disease, loc='upper left', bbox_to_anchor=(1,1))
ax1.set_xlabel('nMDS1'), ax1.set_ylabel('nMDS2')
plt.tight_layout()
plt.show()

Colored by treatment:

sample_names = list(ft.index.values)
sample_colors, sample_markers = [], []
for a in sample_names:
  if a in disease_colors_samples:
    sample_colors.append(treat_colors_samples[a])
  else:
    sample_colors.append('gray')
  if a in type_markers_samples:
    sample_markers.append(type_markers_samples[a])
  else:
    sample_markers.append('s')

plt.figure(figsize=(25,12))
ax1 = plt.subplot(121)
for a in range(len(npos)):
  ax1.scatter(npos[a,0], npos[a,1], color=sample_colors[a], marker=sample_markers[a], s=100)
ax1.legend(handles=handles_type+handles_treat, loc='upper left', bbox_to_anchor=(1,1))
ax1.set_xlabel('nMDS1'), ax1.set_ylabel('nMDS2')
plt.tight_layout()
plt.show()

NMDS (all taxa, genus level)

All taxa classified to genus. This NMDS is calculated based on Bray-Curtis distance between samples. Note that some samples did not appear in the ‘sample_centered_table.tsv’ and therefore have been classified as ‘Unknown’ in the plots. These samples haven’t been normalised in any way.

ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)

ft.drop(['domain'], axis=1, inplace=True)
genus = {}

for a in ft.index.values:
  genus[a] = a.split(' ')[0]
ft.rename(index=genus, inplace=True)
ft = ft.groupby(ft.index.values).sum()

ft = ft.transpose()

pos, npos, stress = transform_for_NMDS(ft)

Colored by disease:

sample_names = list(ft.index.values)
sample_colors, sample_markers = [], []
for a in sample_names:
  if a in disease_colors_samples:
    sample_colors.append(disease_colors_samples[a])
  else:
    sample_colors.append('gray')
  if a in type_markers_samples:
    sample_markers.append(type_markers_samples[a])
  else:
    sample_markers.append('s')

plt.figure(figsize=(25,12))
ax1 = plt.subplot(121)
for a in range(len(npos)):
  ax1.scatter(npos[a,0], npos[a,1], color=sample_colors[a], marker=sample_markers[a], s=100)
ax1.legend(handles=handles_type+handles_disease, loc='upper left', bbox_to_anchor=(1,1))
ax1.set_xlabel('nMDS1'), ax1.set_ylabel('nMDS2')
plt.tight_layout()
plt.show()

Colored by treatment:

sample_names = list(ft.index.values)
sample_colors, sample_markers = [], []
for a in sample_names:
  if a in disease_colors_samples:
    sample_colors.append(treat_colors_samples[a])
  else:
    sample_colors.append('gray')
  if a in type_markers_samples:
    sample_markers.append(type_markers_samples[a])
  else:
    sample_markers.append('s')

plt.figure(figsize=(25,12))
ax1 = plt.subplot(121)
for a in range(len(npos)):
  ax1.scatter(npos[a,0], npos[a,1], color=sample_colors[a], marker=sample_markers[a], s=100)
ax1.legend(handles=handles_type+handles_treat, loc='upper left', bbox_to_anchor=(1,1))
ax1.set_xlabel('nMDS1'), ax1.set_ylabel('nMDS2')
plt.tight_layout()
plt.show()

NMDS (bacteria only, lowest classification)

All taxa classified to species where possible (genus if not). Any that are not classified as bacteria are removed. This NMDS is calculated based on Bray-Curtis distance between samples. Note that some samples did not appear in the ‘sample_centered_table.tsv’ and therefore have been classified as ‘Unknown’ in the plots. These samples haven’t been normalised in any way.

ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)

ft = ft.loc[ft['domain'] == 'bacteria'].fillna(value=0)

ft.drop(['domain'], axis=1, inplace=True)
ft = ft.transpose()

pos, npos, stress = transform_for_NMDS(ft)

Colored by disease:

sample_names = list(ft.index.values)
sample_colors, sample_markers = [], []
for a in sample_names:
  if a in disease_colors_samples:
    sample_colors.append(disease_colors_samples[a])
  else:
    sample_colors.append('gray')
  if a in type_markers_samples:
    sample_markers.append(type_markers_samples[a])
  else:
    sample_markers.append('s')

plt.figure(figsize=(25,12))
ax1 = plt.subplot(121)
for a in range(len(npos)):
  ax1.scatter(npos[a,0], npos[a,1], color=sample_colors[a], marker=sample_markers[a], s=100)
ax1.legend(handles=handles_type+handles_disease, loc='upper left', bbox_to_anchor=(1,1))
ax1.set_xlabel('nMDS1'), ax1.set_ylabel('nMDS2')
plt.tight_layout()
plt.show()

Colored by treatment:

sample_names = list(ft.index.values)
sample_colors, sample_markers = [], []
for a in sample_names:
  if a in disease_colors_samples:
    sample_colors.append(treat_colors_samples[a])
  else:
    sample_colors.append('gray')
  if a in type_markers_samples:
    sample_markers.append(type_markers_samples[a])
  else:
    sample_markers.append('s')

plt.figure(figsize=(25,12))
ax1 = plt.subplot(121)
for a in range(len(npos)):
  ax1.scatter(npos[a,0], npos[a,1], color=sample_colors[a], marker=sample_markers[a], s=100)
ax1.legend(handles=handles_type+handles_treat, loc='upper left', bbox_to_anchor=(1,1))
ax1.set_xlabel('nMDS1'), ax1.set_ylabel('nMDS2')
plt.tight_layout()
plt.show()

NMDS (bacteria only, genus level)

All taxa classified to genus. Any that are not classified as bacteria are removed. This NMDS is calculated based on Bray-Curtis distance between samples. Note that some samples did not appear in the ‘sample_centered_table.tsv’ and therefore have been classified as ‘Unknown’ in the plots. These samples haven’t been normalised in any way.

ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)

ft = ft.loc[ft['domain'] == 'bacteria'].fillna(value=0)

ft.drop(['domain'], axis=1, inplace=True)
genus = {}

for a in ft.index.values:
  genus[a] = a.split(' ')[0]
ft.rename(index=genus, inplace=True)
ft = ft.groupby(ft.index.values).sum()

ft = ft.transpose()

pos, npos, stress = transform_for_NMDS(ft)

Colored by disease:

sample_names = list(ft.index.values)
sample_colors, sample_markers = [], []
for a in sample_names:
  if a in disease_colors_samples:
    sample_colors.append(disease_colors_samples[a])
  else:
    sample_colors.append('gray')
  if a in type_markers_samples:
    sample_markers.append(type_markers_samples[a])
  else:
    sample_markers.append('s')

plt.figure(figsize=(25,12))
ax1 = plt.subplot(121)
for a in range(len(npos)):
  ax1.scatter(npos[a,0], npos[a,1], color=sample_colors[a], marker=sample_markers[a], s=100)
ax1.legend(handles=handles_type+handles_disease, loc='upper left', bbox_to_anchor=(1,1))
ax1.set_xlabel('nMDS1'), ax1.set_ylabel('nMDS2')
plt.tight_layout()
plt.show()

Colored by treatment:

sample_names = list(ft.index.values)
sample_colors, sample_markers = [], []
for a in sample_names:
  if a in disease_colors_samples:
    sample_colors.append(treat_colors_samples[a])
  else:
    sample_colors.append('gray')
  if a in type_markers_samples:
    sample_markers.append(type_markers_samples[a])
  else:
    sample_markers.append('s')

plt.figure(figsize=(25,12))
ax1 = plt.subplot(121)
for a in range(len(npos)):
  ax1.scatter(npos[a,0], npos[a,1], color=sample_colors[a], marker=sample_markers[a], s=100)
ax1.legend(handles=handles_type+handles_treat, loc='upper left', bbox_to_anchor=(1,1))
ax1.set_xlabel('nMDS1'), ax1.set_ylabel('nMDS2')
plt.tight_layout()
plt.show()

Abundance heatmaps

Genus heatmaps grouped by treatment

Unclustered

Here I have transformed samples to relative abundance, grouped them by treatment type (taken the mean of each sample grouping) and plotted the abundance of all genera above 0.5% relative abundance. I have removed samples that didn’t appear in the ‘sample_centered_table.tsv’ file. Note that I haven’t sorted anything for those genera that aren’t a standard binomial Genus species name. If a genus is present in that sample grouping, the relative abundance is shown in the cell.

ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)

ft.drop(['domain'], axis=1, inplace=True)
ft = ft.divide(ft.sum(axis=0), axis=1).multiply(100)
genus = {}

for a in ft.index.values:
  genus[a] = a.split(' ')[0]

ft.rename(index=genus, inplace=True)
ft = ft.groupby(ft.index.values).sum()
ft.rename(columns=sample_treat, inplace=True)
ft = ft.transpose()
ft = ft.groupby(ft.index.values).mean()
ft = ft.transpose()

keeping = []
for col in ft.columns:
  if 'sample' in col:
    keeping.append(False)
  else:
    keeping.append(True)

ft = ft.loc[:, keeping]
ft = ft[ft.max(axis=1) > 0.5]
ft = ft.iloc[::-1]

plt.figure(figsize=(10,40))
ax1 = plt.subplot(111)
colormap, norm = mpl.cm.get_cmap('hot', 256), mpl.colors.Normalize(vmin=0, vmax=5)
m = mpl.cm.ScalarMappable(norm=norm, cmap=colormap)

y = [1 for x in ft.columns]
bottom = [0 for x in ft.columns]
x = [x for x in range(len(ft.columns))]
all_y = []
for a in ft.index.values:
  row = ft.loc[a, :].values
  whole = list(row)
  strings = [str(round(b, 1)) for b in row]
  #ma = max(row)
  row = [m.to_rgba(b) for b in row]
  ax1.bar(x, y, color=row, bottom=bottom, edgecolor='k', width=1)
  for b in range(len(strings)):
    if whole[b] == 0: continue
    if whole[b] > 2.5: color='k'
    else: color='w'
    ax1.text(x[b], bottom[0]+0.5, strings[b], color=color, ha='center', va='center')
  bottom = [x+1 for x in bottom]
  all_y.append(bottom[0]-0.5)

ax1.set_xlim([x[0]-0.5, x[-1]+0.5])
ax1.set_ylim([0, bottom[0]])
plt.yticks(all_y, ft.index.values)
plt.xticks(x, ft.columns, rotation=90)
ax1.xaxis.tick_top()

plt.tight_layout()
plt.show()

Genus and treatment clustered

This shows the same as the previous heatmap grouped by treatment, but here the treatment groupings as well as the genera have been hierarchically clustered (using Bray-Curtis distance).

ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)

ft.drop(['domain'], axis=1, inplace=True)
ft = ft.divide(ft.sum(axis=0), axis=1).multiply(100)
genus = {}

for a in ft.index.values:
  genus[a] = a.split(' ')[0]

ft.rename(index=genus, inplace=True)
ft = ft.groupby(ft.index.values).sum()
ft.rename(columns=sample_treat, inplace=True)
ft = ft.transpose()
ft = ft.groupby(ft.index.values).mean()
ft = ft.transpose()

keeping = []
for col in ft.columns:
  if 'sample' in col:
    keeping.append(False)
  else:
    keeping.append(True)

ft = ft.loc[:, keeping]
ft = ft[ft.max(axis=1) > 0.5]
ft = ft.iloc[::-1]

plt.figure(figsize=(20,50))
ax1 = plt.subplot2grid((50,4), (10,1), rowspan=40, colspan=2)
ax2 = plt.subplot2grid((50,4), (0,1), rowspan=6, colspan=2, frameon=False)
ax3 = plt.subplot2grid((50,4), (10,0), rowspan=40, frameon=False)
plt.sca(ax2)

Z = hierarchy.linkage(ft.transpose(), 'average', metric='braycurtis')
mpl.rcParams['lines.linewidth'] = 2
hierarchy.set_link_color_palette(['k'])
dn = hierarchy.dendrogram(Z, above_threshold_color='k', orientation='top')

y_labels, locs, xlocs, labels = list(ax2.get_xticklabels()), list(ax2.get_xticks()), list(ax2.get_yticks()), []
for y in y_labels:
  labels.append(y.get_text())
grouping = list(ft.columns)
plot_labels = []
for a in range(len(labels)):
  plot_labels.append(grouping[int(labels[a])-1])
plt.xticks([])
plt.yticks([])
ft = ft[plot_labels]

plt.sca(ax3)

Z = hierarchy.linkage(ft, 'average', metric='braycurtis')
mpl.rcParams['lines.linewidth'] = 2
hierarchy.set_link_color_palette(['k'])
dn = hierarchy.dendrogram(Z, above_threshold_color='k', orientation='left')

y_labels, locs, xlocs, labels = list(ax3.get_yticklabels()), list(ax3.get_yticks()), list(ax3.get_xticks()), []
for y in y_labels:
  labels.append(y.get_text())
grouping = list(ft.index.values)
plot_labels = []
for a in range(len(labels)):
  plot_labels.append(grouping[int(labels[a])-1])
plt.xticks([])
plt.yticks([])
plt.sca(ax1)
ft = ft.reindex(plot_labels)

colormap, norm = mpl.cm.get_cmap('hot', 256), mpl.colors.Normalize(vmin=0, vmax=5)
m = mpl.cm.ScalarMappable(norm=norm, cmap=colormap)

y = [1 for x in ft.columns]
bottom = [0 for x in ft.columns]
x = [x for x in range(len(ft.columns))]
all_y = []
for a in ft.index.values:
  row = ft.loc[a, :].values
  whole = list(row)
  strings = [str(round(b, 1)) for b in row]
  #ma = max(row)
  row = [m.to_rgba(b) for b in row]
  ax1.bar(x, y, color=row, bottom=bottom, edgecolor='k', width=1)
  for b in range(len(strings)):
    if whole[b] == 0: continue
    if whole[b] > 2.5: color='k'
    else: color='w'
    ax1.text(x[b], bottom[0]+0.5, strings[b], color=color, ha='center', va='center')
  bottom = [x+1 for x in bottom]
  all_y.append(bottom[0]-0.5)

ax1.set_xlim([x[0]-0.5, x[-1]+0.5])
ax1.set_ylim([0, bottom[0]])
plt.yticks(all_y, ft.index.values)
plt.xticks(x, ft.columns, rotation=90)
ax1.xaxis.tick_top()
ax1.yaxis.tick_right()

plt.subplots_adjust(wspace=0.05)
plt.tight_layout()
plt.show()

Genus sorted by prevalence

This shows the same as the first heatmap grouped by treatment, but here the treatment groupings have been hierarchically clustered (using Bray-Curtis distance) and the genera sorted by prevalence.

ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)

ft.drop(['domain'], axis=1, inplace=True)
ft = ft.divide(ft.sum(axis=0), axis=1).multiply(100)
genus = {}

for a in ft.index.values:
  genus[a] = a.split(' ')[0]

ft.rename(index=genus, inplace=True)
ft = ft.groupby(ft.index.values).sum()
ft.rename(columns=sample_treat, inplace=True)
ft = ft.transpose()
ft = ft.groupby(ft.index.values).mean()
ft = ft.transpose()

keeping = []
for col in ft.columns:
  if 'sample' in col:
    keeping.append(False)
  else:
    keeping.append(True)

ft = ft.loc[:, keeping]
ft = ft[ft.max(axis=1) > 0.5]
ft = ft.iloc[::-1]

plt.figure(figsize=(20,50))
ax1 = plt.subplot2grid((50,4), (10,1), rowspan=40, colspan=2)
ax2 = plt.subplot2grid((50,4), (0,1), rowspan=6, colspan=2, frameon=False)
#ax3 = plt.subplot2grid((50,4), (10,0), rowspan=40, frameon=False)
plt.sca(ax2)

Z = hierarchy.linkage(ft.transpose(), 'average', metric='braycurtis')
mpl.rcParams['lines.linewidth'] = 2
hierarchy.set_link_color_palette(['k'])
dn = hierarchy.dendrogram(Z, above_threshold_color='k', orientation='top')

y_labels, locs, xlocs, labels = list(ax2.get_xticklabels()), list(ax2.get_xticks()), list(ax2.get_yticks()), []
for y in y_labels:
  labels.append(y.get_text())
grouping = list(ft.columns)
plot_labels = []
for a in range(len(labels)):
  plot_labels.append(grouping[int(labels[a])-1])
plt.xticks([])
plt.yticks([])
ft = ft[plot_labels]

plt.sca(ax1)

pres_abs = ft.copy()
for i in pres_abs.columns:
  for j in pres_abs.index.values:
    if pres_abs.loc[j, i] > 0: pres_abs.loc[j, i] = 1
ft['prevalence'] = pres_abs.mean(axis=1)
ft = ft.sort_values(by='prevalence')
ft = ft.drop(['prevalence'], axis=1)

colormap, norm = mpl.cm.get_cmap('hot', 256), mpl.colors.Normalize(vmin=0, vmax=5)
m = mpl.cm.ScalarMappable(norm=norm, cmap=colormap)

y = [1 for x in ft.columns]
bottom = [0 for x in ft.columns]
x = [x for x in range(len(ft.columns))]
all_y = []
for a in ft.index.values:
  row = ft.loc[a, :].values
  whole = list(row)
  strings = [str(round(b, 1)) for b in row]
  #ma = max(row)
  row = [m.to_rgba(b) for b in row]
  ax1.bar(x, y, color=row, bottom=bottom, edgecolor='k', width=1)
  for b in range(len(strings)):
    if whole[b] == 0: continue
    if whole[b] > 2.5: color='k'
    else: color='w'
    ax1.text(x[b], bottom[0]+0.5, strings[b], color=color, ha='center', va='center')
  bottom = [x+1 for x in bottom]
  all_y.append(bottom[0]-0.5)

ax1.set_xlim([x[0]-0.5, x[-1]+0.5])
ax1.set_ylim([0, bottom[0]])
plt.yticks(all_y, ft.index.values)
plt.xticks(x, ft.columns, rotation=90)
ax1.xaxis.tick_top()
ax1.yaxis.tick_right()

plt.subplots_adjust(wspace=0.05)
plt.show()

Genus sorted by abundance

This shows the same as the first heatmap grouped by treatment, but here the treatment groupings have been hierarchically clustered (using Bray-Curtis distance) and the genera sorted by abundance.

ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)

ft.drop(['domain'], axis=1, inplace=True)
ft = ft.divide(ft.sum(axis=0), axis=1).multiply(100)
genus = {}

for a in ft.index.values:
  genus[a] = a.split(' ')[0]

ft.rename(index=genus, inplace=True)
ft = ft.groupby(ft.index.values).sum()
ft.rename(columns=sample_treat, inplace=True)
ft = ft.transpose()
ft = ft.groupby(ft.index.values).mean()
ft = ft.transpose()

keeping = []
for col in ft.columns:
  if 'sample' in col:
    keeping.append(False)
  else:
    keeping.append(True)

ft = ft.loc[:, keeping]
ft = ft[ft.max(axis=1) > 0.5]
ft = ft.iloc[::-1]

plt.figure(figsize=(20,50))
ax1 = plt.subplot2grid((50,4), (10,1), rowspan=40, colspan=2)
ax2 = plt.subplot2grid((50,4), (0,1), rowspan=6, colspan=2, frameon=False)
#ax3 = plt.subplot2grid((50,4), (10,0), rowspan=40, frameon=False)
plt.sca(ax2)

Z = hierarchy.linkage(ft.transpose(), 'average', metric='braycurtis')
mpl.rcParams['lines.linewidth'] = 2
hierarchy.set_link_color_palette(['k'])
dn = hierarchy.dendrogram(Z, above_threshold_color='k', orientation='top')

y_labels, locs, xlocs, labels = list(ax2.get_xticklabels()), list(ax2.get_xticks()), list(ax2.get_yticks()), []
for y in y_labels:
  labels.append(y.get_text())
grouping = list(ft.columns)
plot_labels = []
for a in range(len(labels)):
  plot_labels.append(grouping[int(labels[a])-1])
plt.xticks([])
plt.yticks([])
ft = ft[plot_labels]

plt.sca(ax1)

ft['total'] = ft.mean(axis=1)
ft = ft.sort_values(by='total')
ft = ft.drop(['total'], axis=1)

colormap, norm = mpl.cm.get_cmap('hot', 256), mpl.colors.Normalize(vmin=0, vmax=5)
m = mpl.cm.ScalarMappable(norm=norm, cmap=colormap)

y = [1 for x in ft.columns]
bottom = [0 for x in ft.columns]
x = [x for x in range(len(ft.columns))]
all_y = []
for a in ft.index.values:
  row = ft.loc[a, :].values
  whole = list(row)
  strings = [str(round(b, 1)) for b in row]
  #ma = max(row)
  row = [m.to_rgba(b) for b in row]
  ax1.bar(x, y, color=row, bottom=bottom, edgecolor='k', width=1)
  for b in range(len(strings)):
    if whole[b] == 0: continue
    if whole[b] > 2.5: color='k'
    else: color='w'
    ax1.text(x[b], bottom[0]+0.5, strings[b], color=color, ha='center', va='center')
  bottom = [x+1 for x in bottom]
  all_y.append(bottom[0]-0.5)

ax1.set_xlim([x[0]-0.5, x[-1]+0.5])
ax1.set_ylim([0, bottom[0]])
plt.yticks(all_y, ft.index.values)
plt.xticks(x, ft.columns, rotation=90)
ax1.xaxis.tick_top()
ax1.yaxis.tick_right()

plt.subplots_adjust(wspace=0.05)
plt.show()

Genus heatmaps grouped by treatment

Unclustered

Here I have transformed samples to relative abundance, grouped them by disease type (taken the mean of each sample grouping) and plotted the abundance of all genera above 0.5% relative abundance. I have removed samples that didn’t appear in the ‘sample_centered_table.tsv’ file. Note that I haven’t sorted anything for those genera that aren’t a standard binomial Genus species name. If a genus is present in that sample grouping, the relative abundance is shown in the cell.

ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)

ft.drop(['domain'], axis=1, inplace=True)
ft = ft.divide(ft.sum(axis=0), axis=1).multiply(100)
genus = {}

for a in ft.index.values:
  genus[a] = a.split(' ')[0]

ft.rename(index=genus, inplace=True)
ft = ft.groupby(ft.index.values).sum()
ft.rename(columns=sample_disease, inplace=True)
ft = ft.transpose()
ft = ft.groupby(ft.index.values).mean()
ft = ft.transpose()

keeping = []
for col in ft.columns:
  if 'sample' in col:
    keeping.append(False)
  else:
    keeping.append(True)

ft = ft.loc[:, keeping]
ft = ft[ft.max(axis=1) > 0.5]
ft = ft.iloc[::-1]

print(ft)
plt.figure(figsize=(15,60))
ax1 = plt.subplot(111)
colormap, norm = mpl.cm.get_cmap('hot', 256), mpl.colors.Normalize(vmin=0, vmax=5)
m = mpl.cm.ScalarMappable(norm=norm, cmap=colormap)

y = [1 for x in ft.columns]
bottom = [0 for x in ft.columns]
x = [x for x in range(len(ft.columns))]
all_y = []
for a in ft.index.values:
  row = ft.loc[a, :].values
  whole = list(row)
  strings = [str(round(b, 1)) for b in row]
  #ma = max(row)
  row = [m.to_rgba(b) for b in row]
  ax1.bar(x, y, color=row, bottom=bottom, edgecolor='k', width=1)
  for b in range(len(strings)):
    if whole[b] == 0: continue
    if whole[b] > 2.5: color='k'
    else: color='w'
    ax1.text(x[b], bottom[0]+0.5, strings[b], color=color, ha='center', va='center')
  bottom = [x+1 for x in bottom]
  all_y.append(bottom[0]-0.5)

ax1.set_xlim([x[0]-0.5, x[-1]+0.5])
ax1.set_ylim([0, bottom[0]])
plt.yticks(all_y, ft.index.values)
plt.xticks(x, ft.columns, rotation=90)
ax1.xaxis.tick_top()

plt.tight_layout()
plt.show()

Genus and disease clustered

This shows the same as the previous heatmap grouped by disease, but here the disease groupings as well as the genera have been hierarchically clustered (using Bray-Curtis distance).

ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)

ft.drop(['domain'], axis=1, inplace=True)
ft = ft.divide(ft.sum(axis=0), axis=1).multiply(100)
genus = {}

for a in ft.index.values:
  genus[a] = a.split(' ')[0]

ft.rename(index=genus, inplace=True)
ft = ft.groupby(ft.index.values).sum()
ft.rename(columns=sample_disease, inplace=True)
ft = ft.transpose()
ft = ft.groupby(ft.index.values).mean()
ft = ft.transpose()

keeping = []
for col in ft.columns:
  if 'sample' in col:
    keeping.append(False)
  else:
    keeping.append(True)

ft = ft.loc[:, keeping]
ft = ft[ft.max(axis=1) > 0.5]
ft = ft.iloc[::-1]

plt.figure(figsize=(25,70))
ax1 = plt.subplot2grid((70,5), (10,1), rowspan=60, colspan=3)
ax2 = plt.subplot2grid((70,5), (0,1), rowspan=5, colspan=3, frameon=False)
ax3 = plt.subplot2grid((70,5), (10,0), rowspan=60, frameon=False)
plt.sca(ax2)

Z = hierarchy.linkage(ft.transpose(), 'average', metric='braycurtis')
mpl.rcParams['lines.linewidth'] = 2
hierarchy.set_link_color_palette(['k'])
dn = hierarchy.dendrogram(Z, above_threshold_color='k', orientation='top')

y_labels, locs, xlocs, labels = list(ax2.get_xticklabels()), list(ax2.get_xticks()), list(ax2.get_yticks()), []
for y in y_labels:
  labels.append(y.get_text())
grouping = list(ft.columns)
plot_labels = []
for a in range(len(labels)):
  plot_labels.append(grouping[int(labels[a])-1])
plt.xticks([])
plt.yticks([])
ft = ft[plot_labels]

plt.sca(ax3)

Z = hierarchy.linkage(ft, 'average', metric='braycurtis')
mpl.rcParams['lines.linewidth'] = 2
hierarchy.set_link_color_palette(['k'])
dn = hierarchy.dendrogram(Z, above_threshold_color='k', orientation='left')

y_labels, locs, xlocs, labels = list(ax3.get_yticklabels()), list(ax3.get_yticks()), list(ax3.get_xticks()), []
for y in y_labels:
  labels.append(y.get_text())
grouping = list(ft.index.values)
plot_labels = []
for a in range(len(labels)):
  plot_labels.append(grouping[int(labels[a])-1])
plt.xticks([])
plt.yticks([])
plt.sca(ax1)
ft = ft.reindex(plot_labels)

colormap, norm = mpl.cm.get_cmap('hot', 256), mpl.colors.Normalize(vmin=0, vmax=5)
m = mpl.cm.ScalarMappable(norm=norm, cmap=colormap)

y = [1 for x in ft.columns]
bottom = [0 for x in ft.columns]
x = [x for x in range(len(ft.columns))]
all_y = []
for a in ft.index.values:
  row = ft.loc[a, :].values
  whole = list(row)
  strings = [str(round(b, 1)) for b in row]
  #ma = max(row)
  row = [m.to_rgba(b) for b in row]
  ax1.bar(x, y, color=row, bottom=bottom, edgecolor='k', width=1)
  for b in range(len(strings)):
    if whole[b] == 0: continue
    if whole[b] > 2.5: color='k'
    else: color='w'
    ax1.text(x[b], bottom[0]+0.5, strings[b], color=color, ha='center', va='center')
  bottom = [x+1 for x in bottom]
  all_y.append(bottom[0]-0.5)

ax1.set_xlim([x[0]-0.5, x[-1]+0.5])
ax1.set_ylim([0, bottom[0]])
plt.yticks(all_y, ft.index.values)
plt.xticks(x, ft.columns, rotation=90)
ax1.xaxis.tick_top()
ax1.yaxis.tick_right()

plt.subplots_adjust(wspace=0.05)
plt.show()

Genus sorted by prevalence

This shows the same as the first heatmap grouped by disease, but here the disease groupings have been hierarchically clustered (using Bray-Curtis distance) and the genera sorted by prevalence.

ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)

ft.drop(['domain'], axis=1, inplace=True)
ft = ft.divide(ft.sum(axis=0), axis=1).multiply(100)
genus = {}

for a in ft.index.values:
  genus[a] = a.split(' ')[0]

ft.rename(index=genus, inplace=True)
ft = ft.groupby(ft.index.values).sum()
ft.rename(columns=sample_disease, inplace=True)
ft = ft.transpose()
ft = ft.groupby(ft.index.values).mean()
ft = ft.transpose()

keeping = []
for col in ft.columns:
  if 'sample' in col:
    keeping.append(False)
  else:
    keeping.append(True)

ft = ft.loc[:, keeping]
ft = ft[ft.max(axis=1) > 0.5]
ft = ft.iloc[::-1]

plt.figure(figsize=(25,70))
ax1 = plt.subplot2grid((70,5), (10,1), rowspan=60, colspan=3)
ax2 = plt.subplot2grid((70,5), (0,1), rowspan=5, colspan=3, frameon=False)
#ax3 = plt.subplot2grid((70,5), (10,0), rowspan=60, frameon=False)
plt.sca(ax2)

Z = hierarchy.linkage(ft.transpose(), 'average', metric='braycurtis')
mpl.rcParams['lines.linewidth'] = 2
hierarchy.set_link_color_palette(['k'])
dn = hierarchy.dendrogram(Z, above_threshold_color='k', orientation='top')

y_labels, locs, xlocs, labels = list(ax2.get_xticklabels()), list(ax2.get_xticks()), list(ax2.get_yticks()), []
for y in y_labels:
  labels.append(y.get_text())
grouping = list(ft.columns)
plot_labels = []
for a in range(len(labels)):
  plot_labels.append(grouping[int(labels[a])-1])
plt.xticks([])
plt.yticks([])
ft = ft[plot_labels]


plt.sca(ax1)

pres_abs = ft.copy()
for i in pres_abs.columns:
  for j in pres_abs.index.values:
    if pres_abs.loc[j, i] > 0: pres_abs.loc[j, i] = 1
ft['prevalence'] = pres_abs.mean(axis=1)
ft = ft.sort_values(by='prevalence')
ft = ft.drop(['prevalence'], axis=1)

colormap, norm = mpl.cm.get_cmap('hot', 256), mpl.colors.Normalize(vmin=0, vmax=5)
m = mpl.cm.ScalarMappable(norm=norm, cmap=colormap)

y = [1 for x in ft.columns]
bottom = [0 for x in ft.columns]
x = [x for x in range(len(ft.columns))]
all_y = []
for a in ft.index.values:
  row = ft.loc[a, :].values
  whole = list(row)
  strings = [str(round(b, 1)) for b in row]
  #ma = max(row)
  row = [m.to_rgba(b) for b in row]
  ax1.bar(x, y, color=row, bottom=bottom, edgecolor='k', width=1)
  for b in range(len(strings)):
    if whole[b] == 0: continue
    if whole[b] > 2.5: color='k'
    else: color='w'
    ax1.text(x[b], bottom[0]+0.5, strings[b], color=color, ha='center', va='center')
  bottom = [x+1 for x in bottom]
  all_y.append(bottom[0]-0.5)

ax1.set_xlim([x[0]-0.5, x[-1]+0.5])
ax1.set_ylim([0, bottom[0]])
plt.yticks(all_y, ft.index.values)
plt.xticks(x, ft.columns, rotation=90)
ax1.xaxis.tick_top()
ax1.yaxis.tick_right()

plt.subplots_adjust(wspace=0.05)
plt.show()

Genus sorted by abundance

This shows the same as the first heatmap grouped by disease, but here the disease groupings have been hierarchically clustered (using Bray-Curtis distance) and the genera sorted by abundance.

ft = pd.read_csv('/Users/robynwright/Documents/OneDrive/Langille Lab postdoc/Peptides/feature_table_tax.csv', index_col=0, header=0)

ft.drop(['domain'], axis=1, inplace=True)
ft = ft.divide(ft.sum(axis=0), axis=1).multiply(100)
genus = {}

for a in ft.index.values:
  genus[a] = a.split(' ')[0]

ft.rename(index=genus, inplace=True)
ft = ft.groupby(ft.index.values).sum()
ft.rename(columns=sample_disease, inplace=True)
ft = ft.transpose()
ft = ft.groupby(ft.index.values).mean()
ft = ft.transpose()

keeping = []
for col in ft.columns:
  if 'sample' in col:
    keeping.append(False)
  else:
    keeping.append(True)

ft = ft.loc[:, keeping]
ft = ft[ft.max(axis=1) > 0.5]
ft = ft.iloc[::-1]

plt.figure(figsize=(25,70))
ax1 = plt.subplot2grid((70,5), (10,1), rowspan=60, colspan=3)
ax2 = plt.subplot2grid((70,5), (0,1), rowspan=5, colspan=3, frameon=False)
#ax3 = plt.subplot2grid((70,5), (10,0), rowspan=60, frameon=False)
plt.sca(ax2)

Z = hierarchy.linkage(ft.transpose(), 'average', metric='braycurtis')
mpl.rcParams['lines.linewidth'] = 2
hierarchy.set_link_color_palette(['k'])
dn = hierarchy.dendrogram(Z, above_threshold_color='k', orientation='top')

y_labels, locs, xlocs, labels = list(ax2.get_xticklabels()), list(ax2.get_xticks()), list(ax2.get_yticks()), []
for y in y_labels:
  labels.append(y.get_text())
grouping = list(ft.columns)
plot_labels = []
for a in range(len(labels)):
  plot_labels.append(grouping[int(labels[a])-1])
plt.xticks([])
plt.yticks([])
ft = ft[plot_labels]


plt.sca(ax1)

ft['total'] = ft.mean(axis=1)
ft = ft.sort_values(by='total')
ft = ft.drop(['total'], axis=1)

colormap, norm = mpl.cm.get_cmap('hot', 256), mpl.colors.Normalize(vmin=0, vmax=5)
m = mpl.cm.ScalarMappable(norm=norm, cmap=colormap)

y = [1 for x in ft.columns]
bottom = [0 for x in ft.columns]
x = [x for x in range(len(ft.columns))]
all_y = []
for a in ft.index.values:
  row = ft.loc[a, :].values
  whole = list(row)
  strings = [str(round(b, 1)) for b in row]
  #ma = max(row)
  row = [m.to_rgba(b) for b in row]
  ax1.bar(x, y, color=row, bottom=bottom, edgecolor='k', width=1)
  for b in range(len(strings)):
    if whole[b] == 0: continue
    if whole[b] > 2.5: color='k'
    else: color='w'
    ax1.text(x[b], bottom[0]+0.5, strings[b], color=color, ha='center', va='center')
  bottom = [x+1 for x in bottom]
  all_y.append(bottom[0]-0.5)

ax1.set_xlim([x[0]-0.5, x[-1]+0.5])
ax1.set_ylim([0, bottom[0]])
plt.yticks(all_y, ft.index.values)
plt.xticks(x, ft.columns, rotation=90)
ax1.xaxis.tick_top()
ax1.yaxis.tick_right()

plt.subplots_adjust(wspace=0.05)
plt.show()